Lexical ambiguity and Information Retrieval revisited

نویسندگان

  • Julio Gonzalo
  • Anselmo Peñas
  • M. Felisa Verdejo
چکیده

A number of previous experiments on the role of lexical ambiguity, in Information Retrieval are reproduced on the'IR-Semcor test collection (derived from Semcor), where both queries and documents are hand-tagged ;with phrases, Part-Of-Speech and WordNet 1.5 senses. Our results indicate that a) Word Sense Disambiguation can be more beneficial to Information Retrieval than the experiments of Sanderson (1994) with artificially ambiguous pseudo-words suggested, b) PartOf-Speech tagging does not seem to help Improving retrieval, even if it is manually annotated, c) Using phrases as indexing terms is not a good strategy if no partial credit is given to the phrase components. 1 I n t r o d u c t i o n A major difficulty to experiment with lexical ambiguity issues in Information Retrieval is always to differentiate the effects of the indexing and retrieval strategy being tested from the effects of tagging errors. Some examples are: 1. In (RichardSon and Smeaton, 1995), a sophisticated retrieval system based on conceptual similarity resultled in a decrease of IR performance. It was not possible, however, to distinguish the effects of the strategy and the effects of automatic Wordl Sense Disambiguation (WSD) errors. In (Smeaton and Quigley, 1996), a similar strategy and a combination of manual disambiguation and very short documents -image captionspioduced, however, an improvement of IR perforinance. 2. In (Krovetz, 1997), discriminating word senses with differefit Part-Of-Speech (as annotated by the Church :POS tagger) also harmed retrieval efficiency. Krovetz noted than more than half of the words in a dictionary that differ in POS are related i n meaning, but he could not decide whether the decrease of performance was due to the loss of such semantic relatedness or to automatic POS tagging errors. 3. In (Sanderson, 1994), the problem of discerning the effects of differentiating word senses from the effects of inaccurate disambiguation was overcome using artificially created pseudo-words (substituting, for instance, all occurrences of banana or kalashnikov for banana/kalashnikov) that could be disambiguated with 100% accuracy (substituting banana/kalashnikov back to the original term in each occurrence, either banana or kalashnikov). He found that IR processes were quite resistant to increasing degrees of lexical ambiguity, and that disambiguation harmed IR efficiency if performed with less that 90% accuracy. The question is whether real ambiguous words would behave as pseudo-words. 4. In (Schiitze and Pedersen, 1995) it was shown that sense discriminations extracted from the test collections may enhance text retrieval. However, the static sense inventories in dictionaries or thesauri -such as WordNethave not been used satisfactorily in IR. For instance, in (Voorhees, 1994), manual expansion of TREC queries with semantically related words from WordNet only produced slight improvements with the shortest queries. In order to deal with these problems, we designed an IR test collection which is hand annotated with Part-Of-Speech and semantic tags from WordNet 1.5. This collection was first introduced in (Gonzalo et al., 1998) and it is described in Section 2. This collection is quite small for current IR standards (it is only slightly bigger than the TIME collection), but offers a unique chance to analyze the behavior of semantic approaches to IR before scaling them up to TREC-size collections (where manual tagging is unfeasible). In (Gonzalo et al., 1998), we used the manual annotations in the IR-Semcor collection to show that indexing with WordNet synsets can give significant improvements to Text Retrieval, even for large queries. Such strategy works better than the synonymy expansion in (Voorhees, 1994), probably because it identifies synonym terms but, at the same

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lexical Ambiguity in Cross-language Image Retrieval: a Preliminary Analysis

In this paper we calculate and analyse the lexical ambiguity of queries in a crosslingual Image Retrieval (Flickling) and compare it with the results obtained by users. We want to know to what extent the lexical ambiguity of a query influences the correct localization of an image in a multilingual framework. With this, our final objective is to determine the necessity of Word Sense Disambiguati...

متن کامل

Gesture and its impact of resolving lexical ambiguity

The study aimed to shed light on the use of gesture in resolving lexical ambiguity employed by TEFL students. To this end, 60 intermediate Iranian learners, studying at Kish Way Language School in Iran were recruited. The participants were randomly put into two experimental groups and one control group. Both of the experimental groups received the same teaching approach, i.e. teaching homonyms ...

متن کامل

Translation Events in Cross-language Information Retrieval: Lexical Ambiguity, Lexical Holes, Vocabulary Mismatch, and Correct Translations

Cross-Language Information Retrieval (CLIR) systems enable users to formulate queries in their native language to retrieve documents in foreign languages. Because queries and documents in CLIR do not necessarily share the same language, translation is needed before matching can take place. This translation step tends to cause a reduction in the retrieval performance of CLIR as compared to monol...

متن کامل

Comparative Study of Degree of Bilingualism in Lexical Retrieval and Language Learning Strategies

This study compares lexical retrieval amongst monolinguals and intermediate bilinguals and advanced bilinguals. It also investigates the possible effects of their language learning strategies on their respective lexical retrieval advantage. The study used a mixed methods design and the groups consisted of 20 Persian near-monolinguals, 20 Persian-English intermediate level bilinguals, and 20 Per...

متن کامل

ACADEMIC WRITING REVISITED: A PHRASEOLOGICAL ANALYSIS OF APPLIED LINGUISTICS HIGH-STAKE GENRES FROM THE PERSPECTIVE OF LEXICAL BUNDLES

Lexical bundles are frequent word combinations that commonly appear in different registers. They have been the subject of much research in the area of corpus linguistics during the last decade. While most previous studies of bundles have mainly focused on variations in the use of these word combinations across different registers and a number of disciplines, not much research has been done to e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999